feat(scaler): add observability (metrics + tracing) to the external scaler#1634
feat(scaler): add observability (metrics + tracing) to the external scaler#1634Fedosin wants to merge 1 commit into
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
b55975d to
6121c89
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds OpenTelemetry-based observability to the external scaler by introducing metric instruments/exporters and distributed tracing, plus wiring them into the scaler’s startup and queue polling logic.
Changes:
- Added an OTEL metrics provider + instruments for queue pinger fetch duration/errors and endpoint count, with Prometheus and optional OTLP/HTTP export.
- Added OTEL tracing SDK setup and enabled automatic gRPC server span instrumentation via
otelgrpcwhen tracing is enabled. - Wired metrics/tracing configuration into scaler config and main startup, including a
/metricsHTTP endpoint.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scaler/tracing/tracing.go | Adds OTEL tracing SDK setup and exporter selection for the scaler. |
| scaler/metrics/provider.go | Introduces an OTEL MeterProvider with Prometheus and optional OTLP metric export. |
| scaler/metrics/instruments.go | Defines metric instruments and recording helpers for queue pinger metrics. |
| scaler/queue_pinger.go | Records pinger fetch metrics on each polling cycle. |
| scaler/queue_pinger_test.go | Updates pinger construction in tests for the new instruments parameter. |
| scaler/main.go | Initializes metrics/tracing, adds Prometheus /metrics server, and instruments gRPC when enabled. |
| scaler/config.go | Adds env-configurable metrics and tracing settings for the scaler. |
| go.mod | Adds otelgrpc dependency for gRPC tracing instrumentation. |
| go.sum | Updates sums for added/updated dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
744e536 to
a4f8dbb
Compare
linkvt
left a comment
There was a problem hiding this comment.
I left a few comments, we can probably reduce the size quite a bit after deduplicating code we already have in the interceptor.
The PR description should also probably not say "Fixes #..." to avoid auto closing the issue.
We should also keep in mind to update the helm chart and the resources in config/ in this repo.
| if err != nil { | ||
| setupLog.Error(err, "Kubernetes client config not found") | ||
| os.Exit(1) | ||
| runtime.Goexit() |
There was a problem hiding this comment.
Is there a reason for using runtime.Goexit now? If this is intended we should probably also add the defer os.Exit(1) at top as in the interceptor to also stop the grpc server etc after runtime.Goexit has been called.
There was a problem hiding this comment.
Fixed — added defer os.Exit(1) at the top of main (same pattern as the interceptor). All subsequent failures use runtime.Goexit() to ensure defers run.
| AttrNamespace = "namespace" | ||
| AttrService = "service" |
| @@ -0,0 +1,91 @@ | |||
| package tracing | |||
There was a problem hiding this comment.
This looks like a copy of interceptor/tracing/tracing.go, we should deduplicate this code
There was a problem hiding this comment.
Deduplicated — extracted shared tracing setup into pkg/observability/tracing.go. Both interceptor and scaler now delegate to it.
| @@ -0,0 +1,51 @@ | |||
| package metrics | |||
There was a problem hiding this comment.
This looks like a copy of interceptor/provider/metrics.go, we should deduplicate this code.
There was a problem hiding this comment.
Deduplicated — extracted shared meter provider factory into pkg/observability/metrics.go. Both interceptor and scaler delegate to it with their respective service names.
| perPod, err := fetchCountsPerPod(ctx, q.lggr, q.getEndpointsFn, q.interceptorNS, q.interceptorSvcName, q.adminPort) | ||
| fetchStart := time.Now() | ||
| result, err := fetchCountsPerPod(ctx, q.lggr, q.getEndpointsFn, q.interceptorNS, q.interceptorSvcName, q.adminPort) | ||
| if q.instruments != nil { |
There was a problem hiding this comment.
We could use the same pattern as in the interceptor instruments with NewNoopInstruments() and pass that into the components? This makes the code cleaner as we avoid adding a special case for instruments being nil.
There was a problem hiding this comment.
Done — added NewNoopInstruments() and all tests now use it instead of nil. The nil check in fetchAndSaveCounts is removed.
| meterName = "keda-external-scaler" | ||
|
|
||
| // ServiceName is the OTEL service.name used for both metrics and tracing. | ||
| ServiceName = "keda-http-external-scaler" |
There was a problem hiding this comment.
We should probably also align the service name of the interceptor to use this naming scheme
There was a problem hiding this comment.
Good point. The interceptor already uses keda-http-interceptor as its service name (set in interceptor/tracing/tracing.go). I think that's consistent with the scaler's keda-http-external-scaler. Should we rename the interceptor to something like keda-http-interceptor-proxy or is keda-http-interceptor fine?
| ProfilingAddr string `env:"PROFILING_BIND_ADDRESS" envDefault:""` | ||
| // StreamIntervalMS is the interval in milliseconds between stream ticks | ||
| StreamIntervalMS int `env:"KEDA_HTTP_SCALER_STREAM_INTERVAL_MS" envDefault:"200"` | ||
|
|
There was a problem hiding this comment.
I guess we could also deduplicate the config to ensure it is consistent across components?
There was a problem hiding this comment.
Done — extracted MetricsConfig and TracingConfig into pkg/observability/config.go. Both interceptor and scaler use type aliases to it.
|
|
||
| type metricsConfig struct { | ||
| OtelPrometheusExporterEnabled bool `env:"OTEL_PROM_EXPORTER_ENABLED" envDefault:"true"` | ||
| OtelPrometheusExporterPort int `env:"OTEL_PROM_EXPORTER_PORT" envDefault:"2224"` |
There was a problem hiding this comment.
Is there a reason to not use the same port we use for the interceptor metrics?
There was a problem hiding this comment.
Changed to 2223 (same as the interceptor).
|
|
||
| // RecordFetch records a completed pinger fetch cycle. | ||
| func (i *Instruments) RecordFetch(duration time.Duration, endpointCount int, fetchErr error) { | ||
| attrs := api.WithAttributeSet(attribute.NewSet()) |
There was a problem hiding this comment.
Can be removed as there are no attributes
| @@ -0,0 +1,81 @@ | |||
| package metrics | |||
There was a problem hiding this comment.
We should probably also add a test like the prometheus_test.go the interceptor has right now?
There was a problem hiding this comment.
Added scaler/metrics/prometheus_test.go that verifies the histogram, counter, and gauge are correctly emitted.
a4f8dbb to
f8a351b
Compare
…caler Add OpenTelemetry-based metrics and distributed tracing to the external scaler component, which previously had no observability instrumentation. Shared observability infrastructure is extracted into pkg/observability/ so both the interceptor and scaler (and future components) reuse the same tracing setup, metrics provider, and configuration types. Metrics: - scaler.pinger.fetch.duration (histogram) - scaler.pinger.fetch.errors (counter) - scaler.pinger.endpoints (gauge) - Prometheus /metrics endpoint on port 2223 (configurable) - Optional OTLP HTTP metrics export Tracing: - OTEL tracing SDK with console, HTTP/protobuf, and gRPC exporters - otelgrpc stats handler for automatic gRPC server span instrumentation - W3C TraceContext + Baggage propagation Relates to: kedacore#965 Signed-off-by: Mikhail Fedosin <mfedosin@redhat.com>
f8a351b to
9b6abde
Compare
|
Thanks for the detailed review @linkvt — addressed all inline comments in the latest force-push (
On your note about Helm/config resources: this repo currently has |
Add OpenTelemetry-based metrics and distributed tracing to the external
scaler component, which previously had no observability instrumentation.
Shared observability infrastructure is extracted into
pkg/observability/so both the interceptor and scaler reuse the same tracing setup, metrics
provider, and configuration types.
Metrics:
scaler.pinger.fetch.duration(histogram) — duration of each queue pinger fetch cyclescaler.pinger.fetch.errors(counter) — total failed pinger fetch cyclesscaler.pinger.endpoints(gauge) — number of interceptor endpoints being polled/metricsendpoint on port 2223 (configurable viaOTEL_PROM_EXPORTER_PORT)OTEL_EXPORTER_OTLP_METRICS_ENABLED)Tracing:
otelgrpcstats handler for automatic gRPC server span instrumentationConfiguration env vars (same as the interceptor):
OTEL_PROM_EXPORTER_ENABLED(default:true)OTEL_PROM_EXPORTER_PORT(default:2223)OTEL_EXPORTER_OTLP_METRICS_ENABLED(default:false)OTEL_EXPORTER_OTLP_TRACES_ENABLED(default:false)OTEL_EXPORTER_OTLP_TRACES_PROTOCOL(default:console)Checklist
README.mdPart of #965